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Abstract 


It is estimated that over 90% of all new information produced 
in the world is being stored on magnetic media, most of it on 
hard disk drives. Despite their importance, there is relatively 
little published work on the failure patterns of disk drives, and 
the key factors that affect their lifetime. Most available data 
are either based on extrapolation from accelerated aging exper- 
iments or from relatively modest sized field studies. Moreover, 
larger population studies rarely have the infrastructure in place 
to collect health signals from components in operation, which 
is critical information for detailed failure analysis. 

We present data collected from detailed observations of a 
large disk drive population in a production Internet services de- 
ployment. The population observed is many times larger than 
that of previous studies. In addition to presenting failure statis- 
tics, we analyze the correlation between failures and several 
parameters generally believed to impact longevity. 

Our analysis identifies several parameters from the drive’s 
self monitoring facility (SMART) that correlate highly with 
failures. Despite this high correlation, we conclude that mod- 
els based on SMART parameters alone are unlikely to be useful 
for predicting individual drive failures. Surprisingly, we found 
that temperature and activity levels were much less correlated 
with drive failures than previously reported. 


1 Introduction 


The tremendous advances in low-cost, high-capacity 
magnetic disk drives have been among the key factors 
helping establish a modern society that is deeply reliant 
on information technology. High-volume, consumer- 
grade disk drives have become such a successful prod- 
uct that their deployments range from home computers 
and appliances to large-scale server farms. In 2002, for 
example, it was estimated that over 90% of all new in- 
formation produced was stored on magnetic media, most 
of it being hard disk drives [12]. It is therefore critical 
to improve our understanding of how robust these com- 
ponents are and what main factors are associated with 
failures. Such understanding can be particularly useful 


for guiding the design of storage systems as well as de- 
vising deployment and maintenance strategies. 


Despite the importance of the subject, there are very 
few published studies on failure characteristics of disk 
drives. Most of the available information comes from 
the disk manufacturers themselves [2]. Their data are 
typically based on extrapolation from accelerated life 
test data of small populations or from returned unit 
databases. Accelerated life tests, although useful in pro- 
viding insight into how some environmental factors can 
affect disk drive lifetime, have been known to be poor 
predictors of actual failure rates as seen by customers 
in the field [7]. Statistics from returned units are typi- 
cally based on much larger populations, but since there 
is little or no visibility into the deployment characteris- 
tics, the analysis lacks valuable insight into what actu- 
ally happened to the drive during operation. In addition, 
since units are typically returned during the warranty pe- 
riod (often three years or less), manufacturers’ databases 
may not be as helpful for the study of long-term effects. 


A few recent studies have shed some light on field 
failure behavior of disk drives [6, 7, 9, 16, 17, 19, 20]. 
However, these studies have either reported on relatively 
modest populations or did not monitor the disks closely 
enough during deployment to provide insights into the 
factors that might be associated with failures. 


Disk drives are generally very reliable but they are 
also very complex components. This combination 
means that although they fail rarely, when they do fail, 
the possible causes of failure can be numerous. As a 
result, detailed studies of very large populations are the 
only way to collect enough failure statistics to enable 
meaningful conclusions. In this paper we present one 
such study by examining the population of hard drives 
under deployment within Google’s computing infras- 
tructure. 


We have built an infrastructure that collects vital in- 
formation about all Google’s systems every few min- 
utes, and a repository that stores these data in time- 
series format (essentially forever) for further analysis. 


The information collected includes environmental fac- 
tors (such as temperatures), activity levels and many of 
the Self-Monitoring Analysis and Reporting Technology 
(SMART) parameters that are believed to be good indi- 
cators of disk drive health. We mine through these data 
and attempt to find evidence that corroborates or con- 
tradicts many of the commonly held beliefs about how 
various factors can affect disk drive lifetime. 


Our paper is unique in that it is based on data from a 
disk population size that is typically only available from 
vendor warranty databases, but has the depth of deploy- 
ment visibility and detailed lifetime follow-up that only 
an end-user study can provide. Our key findings are: 


e Contrary to previously reported results, we found 
very little correlation between failure rates and ei- 
ther elevated temperature or activity levels. 


e Some SMART parameters (scan errors, realloca- 
tion counts, offline reallocation counts, and proba- 
tional counts) have a large impact on failure proba- 
bility. 


e Given the lack of occurrence of predictive SMART 
signals on a large fraction of failed drives, it is un- 
likely that an accurate predictive failure model can 
be built based on these signals alone. 


2 Background 


In this section we describe the infrastructure that was 
used to gather and process the data used in this study, 
the types of disk drives included in the analysis, and in- 
formation on how they are deployed. 


2.1 The System Health Infrastructure 


The System Health infrastructure is a large distributed 
software system that collects and stores hundreds of 
attribute-value pairs from all of Google’s servers, and 
provides the interface for arbitrary analysis jobs to pro- 
cess that data. 


The architecture of the System Health infrastructure 
is shown in Figure 1. It consists of a data collection 
layer, a distributed repository and an analysis frame- 
work. The collection layer is responsible for getting in- 
formation from each of thousands of individual servers 
into a centralized repository. Different flavors of col- 
lectors exist to gather different types of data. Much of 
the health information is obtained from the machines di- 
rectly. A daemon runs on every machine and gathers 
local data related to that machine’s health, such as envi- 
ronmental parameters, utilization information of various 
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Figure 1: Collection, storage, and analysis architecture. 


resources, error indications, and configuration informa- 
tion. It is imperative that this daemon’s resource usage 
be very light, so not to interfere with the applications. 
One way to assure this is to have the machine-level col- 
lector poll individual machines relatively infrequently 
(every few minutes). Other slower changing data (such 
as configuration information) and data from other exist- 
ing databases can be collected even less frequently than 
that. Most notably for this study, data regarding ma- 
chine repairs and disk swaps are pulled in from another 
database. 

The System Health database is built upon Bigtable 
[3], a distributed data repository widely used within 
Google, which itself is built upon the Google File Sys- 
tem (GFS) [8]. Bigtable takes care of all the data layout, 
compression, and access chores associated with a large 
data store. It presents the abstraction of a 2-dimensional 
table of data cells, with different versions over time mak- 
ing up a third dimension. It is a natural fit for keeping 
track of the values of different variables (columns) for 
different machines (rows) over time. The System Health 
database thus retains a complete time-ordered history of 
the environment, utilization, error, configuration, and re- 
pair events in each machine’s life. 

Analysis programs run on top of the System Health 
database, looking at information from individual ma- 
chines, or mining the data across thousands of machines. 
Large-scale analysis programs are typically built upon 
Google’s Mapreduce [5] framework. Mapreduce auto- 
mates the mechanisms of large-scale distributed compu- 


tation (such as work distribution, load balancing, toler- 
ance of failures), allowing the user to focus simply on 
the algorithms that make up the heart of the computa- 
tion. 


The analysis pipeline used for this study consists of 
a Mapreduce job written in the Sawzall language and 
framework [15] to extract and clean up periodic SMART 
data and repair data related to disks, followed by a pass 
through R [1] for statistical analysis and final graph gen- 
eration. 


2.2 Deployment Details 


The data in this study are collected from a large num- 
ber of disk drives, deployed in several types of systems 
across all of Google’s services. More than one hundred 
thousand disk drives were used for all the results pre- 
sented here. The disks are a combination of serial and 
parallel ATA consumer-grade hard disk drives, ranging 
in speed from 5400 to 7200 rpm, and in size from 80 to 
400 GB. All units in this study were put into production 
in or after 2001. The population contains several models 
from many of the largest disk drive manufacturers and 
from at least nine different models. The data used for 
this study were collected between December 2005 and 
August 2006. 


As is common in server-class deployments, the disks 
were powered on, spinning, and generally in service for 
essentially all of their recorded life. They were deployed 
in rack-mounted servers and housed in professionally- 
managed datacenter facilities. 


Before being put into production, all disk drives go 
through a short burn-in process, which consists of a 
combination of read/write stress tests designed to catch 
many of the most common assembly, configuration, or 
component-level problems. The data shown here do not 
include the fall-out from this phase, but instead begin 
when the systems are officially commissioned for use. 
Therefore our data should be consistent with what a reg- 
ular end-user should see, since most equipment manu- 
facturers put their systems through similar tests before 
shipment. 


2.3 Data Preparation 


Definition of Failure. Narrowly defining what consti- 
tutes a failure is a difficult task in such a large opera- 
tion. Manufacturers and end-users often see different 
statistics when computing failures since they use differ- 
ent definitions for it. While drive manufacturers often 
quote yearly failure rates below 2% [2], user studies have 
seen rates as high as 6% [9]. Elerath and Shah [7] report 
between 15-60% of drives considered to have failed at 


the user site are found to have no defect by the manu- 
facturers upon returning the unit. Hughes et al. [11] ob- 
serve between 20-30% “no problem found” cases after 
analyzing failed drives from their study of 3477 disks. 


From an end-user’s perspective, a defective drive is 
one that misbehaves in a serious or consistent enough 
manner in the user’s specific deployment scenario that 
it is no longer suitable for service. Since failures are 
sometimes the result of a combination of components 
(i.e., a particular drive with a particular controller or ca- 
ble, etc), it is no surprise that a good number of drives 
that fail for a given user could be still considered op- 
erational in a different test harness. We have observed 
that phenomenon ourselves, including situations where 
a drive tester consistently “green lights” a unit that in- 
variably fails in the field. Therefore, the most accurate 
definition we can present of a failure event for our study 
is: a drive is considered to have failed if it was replaced 
as part of a repairs procedure. Note that this definition 
implicitly excludes drives that were replaced due to an 
upgrade. 


Since it is not always clear when exactly a drive failed, 
we consider the time of failure to be when the drive was 
replaced, which can sometimes be a few days after the 
observed failure event. It is also important to mention 
that the parameters we use in this study were not in use 
as part of the repairs diagnostics procedure at the time 
that these data were collected. Therefore there is no risk 
of false (forced) correlations between these signals and 
repair outcomes. 


Filtering. With such a large number of units monitored 
over a long period of time, data integrity issues invari- 
ably show up. Information can be lost or corrupted along 
our collection pipeline. Therefore, some cleaning up of 
the data is necessary. In the case of missing values, the 
individual values are marked as not available and that 
specific piece of data is excluded from the detailed stud- 
ies. Other records for that same drive are not discarded. 


In cases where the data are clearly spurious, the entire 
record for the drive is removed, under the assumption 
that one piece of spurious data draws into question other 
fields for the same drive. Identifying spurious data, how- 
ever, is a tricky task. Because part of the goal of studying 
the data is to learn what the numbers mean, we must be 
careful not to discard too much data that might appear 
invalid. So we define spurious simply as negative counts 
or data values that are clearly impossible. For exam- 
ple, some drives have reported temperatures that were 
hotter than the surface of the sun. Others have had neg- 
ative power cycles. These were deemed spurious and 
removed. On the other hand, we have not filtered any 
suspiciously large counts from the SMART signals, un- 
der the hypothesis that large counts, while improbable as 


raw numbers, are likely to be good indicators of some- 
thing really bad with the drive. Filtering for spurious 
values reduced the sample set size by less than 0.1%. 


3 Results 


We now analyze the failure behavior of our fleet of disk 
drives using detailed monitoring data collected over a 
nine-month observation window. During this time we 
recorded failure events as well as all the available en- 
vironmental and activity data and most of the SMART 
parameters from the drives themselves. Failure informa- 
tion spanning a much longer interval (approximately five 
years) was also mined from an older repairs database. 
All the results presented here were tested for their statis- 
tical significance using the appropriate tests. 


3.1 Baseline Failure Rates 


Figure 2 presents the average Annualized Failure Rates 
(AFR) for all drives in our study, aged zero to 5 years, 
and is derived from our older repairs database. The data 
are broken down by the age a drive was when it failed. 
Note that this implies some overlap between the sample 
sets for the 3-month, 6-month, and 1-year ages, because 
a drive can reach its 3-month, 6-month and 1-year age 
all within the observation period. Beyond 1-year there is 
no more overlap. 

While it may be tempting to read this graph as strictly 
failure rate with drive age, drive model factors are 
strongly mixed into these data as well. We tend to source 
a particular drive model only for a limited time (as new, 
more cost-effective models are constantly being intro- 
duced), so it is often the case that when we look at sets 
of drives of different ages we are also looking at a very 
different mix of models. Consequently, these data are 
not directly useful in understanding the effects of disk 
age on failure rates (the exception being the first three 
data points, which are dominated by a relatively stable 
mix of disk drive models). The graph is nevertheless a 
good way to provide a baseline characterization of fail- 
ures across our population. It is also useful for later 
studies in the paper, where we can judge how consistent 
the impact of a given parameter is across these diverse 
drive model groups. A consistent and noticeable impact 
across all groups indicates strongly that the signal being 
measured has a fundamentally powerful correlation with 
failures, given that it is observed across widely varying 
ages and models. 

The observed range of AFRs (see Figure 2) varies 
from 1.7%, for drives that were in their first year of op- 
eration, to over 8.6%, observed in the 3-year old pop- 
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Figure 2: Annualized failure rates broken down by age groups 


ulation. The higher baseline AFR for 3 and 4 year old 
drives is more strongly influenced by the underlying re- 
liability of the particular models in that vintage than by 
disk drive aging effects. It is interesting to note that our 
3-month, 6-months and 1-year data points do seem to 
indicate a noticeable influence of infant mortality phe- 
nomena, with 1-year AFR dropping significantly from 
the AFR observed in the first three months. 


3.2 Manufacturers, Models, and Vintages 


Failure rates are known to be highly correlated with drive 
models, manufacturers and vintages [18]. Our results do 
not contradict this fact. For example, Figure 2 changes 
significantly when we normalize failure rates per each 
drive model. Most age-related results are impacted by 
drive vintages. However, in this paper, we do not show a 
breakdown of drives per manufacturer, model, or vintage 
due to the proprietary nature of these data. 

Interestingly, this does not change our conclusions. In 
contrast to age-related results, we note that all results 
shown in the rest of the paper are not affected signifi- 
cantly by the population mix. None of our SMART data 
results change significantly when normalized by drive 
model. The only exception is seek error rate, which is 
dependent on one specific drive manufacturer, as we dis- 
cuss in section 3.5.5. 


3.3 Utilization 


The literature generally refers to utilization metrics by 
employing the term duty cycle which unfortunately has 
no consistent and precise definition, but can be roughly 
characterized as the fraction of time a drive is active out 
of the total powered-on time. What is widely reported in 
the literature is that higher duty cycles affect disk drives 
negatively [4, 21]. 


It is difficult for us to arrive at a meaningful numer- 
ical utilization metric given that our measurements do 
not provide enough detail to derive what 100% utiliza- 
tion might be for any given disk model. We choose in- 
stead to measure utilization in terms of weekly averages 
of read/write bandwidth per drive. We categorize utiliza- 
tion in three levels: low, medium and high, correspond- 
ing respectively to the lowest 25th percentile, 50-75th 
percentiles and top 75th percentile. This categorization 
is performed for each drive model, since the maximum 
bandwidths have significant variability across drive fam- 
ilies. We note that using number of I/O operations and 
bytes transferred as utilization metrics provide very sim- 
ilar results. Figure 3 shows the impact of utilization on 
AFR across the different age groups. 


Overall, we expected to notice a very strong and con- 
sistent correlation between high utilization and higher 
failure rates. However our results appear to paint a more 
complex picture. First, only very young and very old 
age groups appear to show the expected behavior. Af- 
ter the first year, the AFR of high utilization drives is 
at most moderately higher than that of low utilization 
drives. The three-year group in fact appears to have the 
opposite of the expected behavior, with low utilization 
drives having slightly higher failure rates than high uti- 
lization ones. 

One possible explanation for this behavior is the sur- 
vival of the fittest theory. It is possible that the fail- 
ure modes that are associated with higher utilization are 
more prominent early in the drive’s lifetime. If that is the 
case, the drives that survive the infant mortality phase 
are the least susceptible to that failure mode, and result 
in a population that is more robust with respect to varia- 
tions in utilization levels. 


Another possible explanation is that previous obser- 
vations of high correlation between utilization and fail- 
ures has been based on extrapolations from manufactur- 
ers’ accelerated life experiments. Those experiments are 
likely to better model early life failure characteristics, 
and as such they agree with the trend we observe for the 
young age groups. It is possible, however, that longer 
term population studies could uncover a less pronounced 
effect later in a drive’s lifetime. 


When we look at these results across individual mod- 
els we again see a complex pattern, with varying pat- 
terns of failure behavior across the three utilization lev- 
els. Taken as a whole, our data indicate a much weaker 
correlation between utilization levels and failures than 
previous work has suggested. 
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Figure 3: Utilization AFR 


3.4 Temperature 


Temperature is often quoted as the most important envi- 
ronmental factor affecting disk drive reliability. Previous 
studies have indicated that temperature deltas as low as 
15C can nearly double disk drive failure rates [4]. Here 
we take temperature readings from the SMART records 
every few minutes during the entire 9-month window 
of observation and try to understand the correlation be- 
tween temperature levels and failure rates. 


We have aggregated temperature readings in several 
different ways, including averages, maxima, fraction of 
time spent above a given temperature value, number of 
times a temperature threshold is crossed, and last tem- 
perature before failure. Here we report data on averages 
and note that other aggregation forms have shown sim- 
ilar trends and and therefore suggest the same conclu- 
sions. 


We first look at the correlation between average tem- 
perature during the observation period and failure. Fig- 
ure 4 shows the distribution of drives with average tem- 
perature in increments of one degree and the correspond- 
ing annualized failure rates. The figure shows that fail- 
ures do not increase when the average temperature in- 
creases. In fact, there is a clear trend showing that lower 
temperatures are associated with higher failure rates. 
Only at very high temperatures is there a slight reversal 
of this trend. 


Figure 5 looks at the average temperatures for differ- 
ent age groups. The distributions are in sync with Figure 
4 showing a mostly flat failure rate at mid-range temper- 
atures and a modest increase at the low end of the tem- 
perature distribution. What stands out are the 3 and 4- 
year old drives, where the trend for higher failures with 
higher temperature is much more constant and also more 
pronounced. 


Overall our experiments can confirm previously re- 
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Figure 4: Distribution of average temperatures and failures 
rates. 
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Figure 5: AFR for average drive temperature. 


ported temperature effects only for the high end of our 
temperature range and especially for older drives. In the 
lower and middle temperature ranges, higher tempera- 
tures are not associated with higher failure rates. This is 
a fairly surprising result, which could indicate that data- 
center or server designers have more freedom than pre- 
viously thought when setting operating temperatures for 
equipment that contains disk drives. We can conclude 
that at moderate temperature ranges it is likely that there 
are other effects which affect failure rates much more 
strongly than temperatures do. 


3.5 SMART Data Analysis 


We now look at the various self-monitoring signals that 
are available from virtually all of our disk drives through 
the SMART standard interface. Our analysis indicates 
that some signals appear to be more relevant to the study 
of failures than others. We first look at those in detail, 
and then list a summary of our findings for the remaining 


ones. At the end of this section we discuss our results 
and reason about the usefulness of SMART parameters 
in obtaining predictive models for individual disk drive 
failures. 


We present results in three forms. First we compare 
the AFR of drives with zero and non-zero counts for a 
given parameter, broken down by the same age groups 
as in figures 2 and 3. We also find it useful to plot the 
probability of survival of drives over the nine-month ob- 
servation window for different ranges of parameter val- 
ues. Finally, in addition to the graphs, we devise a sin- 
gle metric that could relay how relevant the values of 
a given SMART parameter are in predicting imminent 
failures. To that end, for each SMART parameter we 
look for thresholds that increased the probability of fail- 
ure in the next 60 days by at least a factor of 10 with 
respect to drives that have zero counts for that parame- 
ter. We report such Critical Thresholds whenever we are 
able to find them with high confidence (> 95%). 


3.5.1 Scan Errors 


Drives typically scan the disk surface in the background 
and report errors as they discover them. Large scan error 
counts can be indicative of surface defects, and therefore 
are believed to be indicative of lower reliability. In our 
population, fewer than 2% of the drives show scan errors 
and they are nearly uniformly spread across various disk 
models. 


Figure 6 shows the AFR values of two groups of 
drives, those without scan errors and those with one or 
more. We plot bars across all age groups in which we 
have statistically significant data. We find that the group 
of drives with scan errors are ten times more likely to fail 
than the group with no errors. This effect is also noticed 
when we further break down the groups by disk model. 


From Figure 8 we see a drastic and quick decrease in 
survival probability after the first scan error (left graph). 
A little over 70% of the drives survive the first 8 months 
after their first scan error. The dashed lines represent the 
95% confidence interval. The middle plot in Figure 8 
separates the population in four age groups (in months), 
and shows an effect that is not visible in the AFR plots. It 
appears that scan errors affect the survival probability of 
young drives more dramatically very soon after the first 
scan error occurs, but after the first month the curve flat- 
tens out. Older drives, however, continue to see a steady 
decline in survival probability throughout the 8-month 
period. This behavior could be another manifestation of 
infant mortality phenomenon. The right graph in figure 8 
looks at the effect of multiple scan errors. While drives 
with one error are more likely to fail than those with 
none, drives with multiple errors fail even more quickly. 
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Figure 8: Impact of scan errors on survival probability. Left figure shows aggregate survival probability for all drives after first 
scan error. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their 


number of scan errors. 


The critical threshold analysis confirms what the 
charts visually imply: the critical threshold for scan er- 
rors is one. After the first scan error, drives are 39 times 
more likely to fail within 60 days than drives without 
scan errors. 


3.5.2 Reallocation Counts 


When the drive’s logic believes that a sector is damaged 
(typically as a result of recurring soft errors or a hard er- 
ror) it can remap the faulty sector number to a new phys- 
ical sector drawn from a pool of spares. Reallocation 
counts reflect the number of times this has happened, 
and is seen as an indication of drive surface wear. About 
9% of our population has reallocation counts greater 
than zero. Although some of our drive models show 
higher absolute values than others, the trends we observe 
are similar across all models. 


As with scan errors, the presence of reallocations 
seems to have a consistent impact on AFR for all age 


groups (Figure 7), even if slightly less pronounced. 
Drives with one or more reallocations do fail more of- 
ten than those with none. The average impact on AFR 
appears to be between a factor of 3-6x. 

Figure 11 shows the survival probability after the first 
reallocation. We truncate the graph to 8.5 months, due 
to a drastic decrease in the confidence levels after that 
point. In general, the left graph shows, about 85% of the 
drives survive past 8 months after the first reallocation. 
The effect is more pronounced (middle graph) for drives 
in the age ranges [10,20) and [20, 60] months, while 
newer drives in the range [0,5) months suffer more than 
their next generation. This could again be due to infant 
mortality effects, although it appears to be less drastic in 
this case than for scan errors. 

After their first reallocation, drives are over 14 times 
more likely to fail within 60 days than drives without 
reallocation counts, making the critical threshold for this 
parameter also one. 
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Figure 11: Impact of reallocation count values on survival probability. Left figure shows aggregate survival probability for all 
drives after first reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down 


drives by their number of reallocations. 


3.5.3 Offline Reallocations 


Offline reallocations are defined as a subset of the real- 
location counts studied previously, in which only real- 
located sectors found during background scrubbing are 
counted. In other words, it should exclude sectors that 
are reallocated as a result of errors found during actual 
I/O operations. Although this definition mostly holds, 
we see evidence that certain disk models do not imple- 
ment this definition. For instance, some models show 
more offline reallocations than total reallocations. Since 
the impact of offline reallocations appears to be signif- 
icant and not identical to that of total reallocations, we 
decided to present it separately (Figure 9). About 4% of 
our population shows non-zero values for offline reallo- 
cations, and they tend to be concentrated on a particular 
subset of drive models. 

Overall, the effects on survival probability of offline 
reallocation seem to be more drastic than those of to- 
tal reallocations, as seen in Figure 12 (as before, some 
curves are clipped at 8 months because our data for those 


points were not within high confidence intervals). Drives 
in the older age groups appear to be more highly affected 
by it, although we are unable to attribute this effect to 
age given the different model mixes in the various age 
groups. 

After the first offline reallocation, drives have over 
21 times higher chances of failure within 60 days than 
drives without offline reallocations; an effect that is 
again more drastic than total reallocations. 


Our data suggest that, although offline reallocations 
could be an important parameter affecting failures, it is 
particularly important to interpret trends in these values 
within specific models, since there is some evidence that 
different drive models may classify reallocations differ- 
ently. 


3.5.4 Probational Counts 


Disk drives put suspect bad sectors “on probation” un- 
til they either fail permanently and are reallocated or 
continue to work without problems. Probational counts, 
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Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drives 
after first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down 


drives by their number offline reallocation. 
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Figure 13: Impact of probational count values on survival probability. Left figure shows aggregate survival probability for all 
drives after first probational count. Middle figure breaks down survival probability per drive ages in months. Right figure breaks 


down drives by their number of probational counts. 


therefore, can be seen as a softer error indication. It 
could provide earlier warning of possible problems but 
might also be a weaker signal, in that sectors on pro- 
bation may indeed never be reallocated. About 2% of 
our drives had non-zero probational count values. We 
note that this number is lower than both online and of- 
fline reallocation counts, likely indicating that sectors 
may be removed from probation after further observa- 
tion of their behavior. Once more, the distribution of 
drives with non-zero probational counts are somewhat 
skewed towards a subset of disk drive models. 


Figures 10 and 13 show that probational count trends 
are generally similar to those observed for offline re- 
allocations, with age group being somewhat less pro- 
nounced. The critical threshold for probational counts 
is also one: after the first event, drives are 16 times more 
likely to fail within 60 days than drives with zero proba- 
tional counts. 


3.5.5 Miscellaneous Signals 


In addition to the SMART parameters described in the 
previous sections, which we have found to most closely 
impact failure rates, we have also studied several other 
parameters from the SMART set as well as other envi- 
ronmental factors. Here we briefly mention our relevant 
findings for some of those parameters. 


Seek Errors. Seek errors occur when a disk drive fails to 
properly track a sector and needs to wait for another rev- 
olution to read or write from or to a sector. Drives report 
it as arate, and it is meant to be used in combination with 
model-specific thresholds. When examining our popu- 
lation, we find that seek errors are widespread within 
drives of one manufacturer only, while others are more 
conservative in showing this kind of errors. For this one 
manufacturer, the trend in seek errors is not clear, chang- 
ing from one vintage to another. For other manufactur- 
ers, there is no correlation between failure rates and seek 
errors. 


CRC Errors. Cyclic redundancy check (CRC) errors 


are detected during data transmission between the phys- 
ical media and the interface. Although we do observe 
some correlation between higher CRC counts and fail- 
ures, those effects are somewhat less pronounced. CRC 
errors are less indicative of drive failures than that of ca- 
bles and connectors. About 2% of our population had 
CRC errors. 


Power Cycles. The power cycles indicator counts the 
number of times a drive is powered up and down. In 
a server-class deployment, in which drives are powered 
continuously, we do not expect to reach high enough 
power cycle counts to see any effects on failure rates. 
Our results find that for drives aged up to two years, this 
is true, there is no significant correlation between fail- 
ures and high power cycles count. But for drives 3 years 
and older, higher power cycle counts can increase the 
absolute failure rate by over 2%. We believe this is due 
more to our population mix than to aging effects. More- 
over, this correlation could be the effect (not the cause) 
of troubled machines that require many repair iterations 
and thus many power cycles to be fixed. 


Calibration Retries. We were unable to reach a consis- 
tent and clear definition of this SMART parameter from 
public documents as well as consultations with some of 
the disk manufacturers. Nevertheless, our observations 
do not indicate that this is a particularly useful parame- 
ter for the goals of this study. Under 0.3% of our drives 
have calibration retries, and of that group only about 2% 
have failed, making this a very weak and imprecise sig- 
nal when compared with other SMART parameters. 


Spin Retries. Counts the number of retries when the 
drive is attempting to spin up. We did not register a sin- 
gle count within our entire population. 


Power-on hours Although we do not dispute that 
power-on hours might have an effect on drive lifetime, 
it happens that in our deployment the age of the drive is 
an excellent approximation for that parameter, given that 
our drives remain powered on for most of their life time. 


Vibration This is not a parameter that is part of the 
SMART set, but it is one that is of general concern in de- 
signing drive enclosures as most manufacturers describe 
how vibration can affect both performance and reliabil- 
ity of disk drives. Unfortunately we do not have sensor 
information to measure this effect directly for drives in 
service. We attempted to indirectly infer vibration ef- 
fects by considering the differences in failure rates be- 
tween systems with a single drive and those with mul- 
tiple drives, but those experiments were not controlled 
enough for other possible factors to allow us to reach 
any conclusions. 


3.5.6 Predictive Power of SMART Parameters 


Given how strongly correlated some SMART parame- 
ters were found to be with higher failure rates, we were 
hopeful that accurate predictive failure models based on 
SMART signals could be created. Predictive models are 
very useful in that they can reduce service disruption 
due to failed components and allow for more efficient 
scheduled maintenance processes to replace the less ef- 
ficient (and reactive) repairs procedures. In fact, one of 
the main motivations for SMART was to provide enough 
insight into disk drive behavior to enable such models to 
be built. 


After our initial attempts to derive such models 
yielded relatively unimpressive results, we turned to the 
question of what might be the upper bound of the accu- 
racy of any model based solely on SMART parameters. 
Our results are surprising, if not somewhat disappoint- 
ing. Out of all failed drives, over 56% of them have no 
count in any of the four strong SMART signals, namely 
scan errors, reallocation count, offline reallocation, and 
probational count. In other words, models based only 
on those signals can never predict more than half of the 
failed drives. Figure 14 shows that even when we add 
all remaining SMART parameters (except temperature) 
we still find that over 36% of all failed drives had zero 
counts on all variables. This population includes seek 
error rates, which we have observed to be widespread in 
our population (> 72% of our drives have it) which fur- 
ther reduces the sample size of drives without any errors. 


It is difficult to add temperature to this analysis since 
despite it being reported as part of SMART there are no 
crisp thresholds that directly indicate errors. However, 
if we arbitrarily assume that spending more than 50% 
of the observed time above 40C is an indication of pos- 
sible problem, and add those drives to the set of pre- 
dictable failures, we still are left with about 36% of all 
drives with no failure signals at all. Actual useful mod- 
els, which need to have small false-positive rates are in 
fact likely to do much worse than these limits might sug- 
gest. 


We conclude that it is unlikely that SMART data alone 
can be effectively used to build models that predict fail- 
ures of individual drives. SMART parameters still ap- 
pear to be useful in reasoning about the aggregate reli- 
ability of large disk populations, which is still very im- 
portant for logistics and supply-chain planning. It is pos- 
sible, however, that models that use parameters beyond 
those provided by SMART could achieve significantly 
better accuracies. For example, performance anomalies 
and other application or operating system signals could 
be useful in conjunction with SMART data to create 
more powerful models. We plan to explore this possi- 
bility in our future work. 
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Figure 14: Percentage of failed drives with SMART errors. 
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Previous studies in this area generally fall into two cat- 
egories: vendor (disk drive or storage appliance) tech- 
nical papers and user experience studies. Disk ven- 
dors studies provide valuable insight into the electro- 
mechanical characteristics of disks and both model- 
based and experimental data that suggests how several 
environmental factors and usage activities can affect de- 
vice lifetime. Yang and Sun [21] and Cole [4] describe 
the processes and experimental setup used by Quantum 
and Seagate to test new units and the models that attempt 
to make long-term reliability predictions based on accel- 
erated life tests of small populations. Power-on-hours, 
duty cycle, temperature are identified as the key deploy- 
ment parameters that impact failure rates, each of them 
having the potential to double failure rates when going 
from nominal to extreme values. For example, Cole 
presents thermal de-rating models showing that MTBF 
could degrade by as much as 50% when going from op- 
erating temperatures of 30C to 40C. Cole’s report also 
presents yearly failure rates from Seagate’s warranty 
database, indicating a linear decrease in annual failure 
rates from 1.2% in the first year to 0.39% in the third 
(and last year of record). In our study, we did not find 
much correlation between failure rate and either elevated 
temperature or utilization. It is the most surprising result 
of our study. Our annualized failure rates were generally 
higher than those reported by vendors, and more consis- 
tent with other user experience studies. 

Shah and Elerath have written several papers based 
on the behavior of disk drives inside Network Appli- 
ance storage products [6, 7, 19]. They use a reliability 
database that includes field failure statistics as well as 
support logs, and their position as an appliance vendor 
enables them more control and visibility into actual de- 


ployments than a typical disk drive vendor might have. 
Although they do not report directly on the correlation 
between SMART parameters or environmental factors 
and failures (possibly for confidentiality concerns), their 
work is useful in enabling a qualitative understanding 
of factors what affect disk drive reliability. For exam- 
ple, they comment that end-user failure rates can be as 
much as ten times higher than what the drive manufac- 
turer might expect [7]; they report in [6] a strong experi- 
mental correlation between number of heads and higher 
failure rates (an effect that is also predicted by the mod- 
els in [4]); and they observe that different failure mech- 
anisms are at play at different phases of a drive lifetime 
[19]. Generally, our findings are in line with these re- 
sults. 


User experience studies may lack the depth of insight 
into the device inner workings that is possible in man- 
ufacturer reports, but they are essential in understand- 
ing device behavior in real-world deployments. Unfortu- 
nately, there are very few such studies to date, probably 
due to the large number of devices needed to observe 
statistically significant results and the complex infras- 
tructure required to track failures and their contributing 
factors. 


Talagala and Patterson [20] perform a detailed er- 
ror analysis of 368 SCSI disk drives over an eighteen 
month period, reporting a failure rate of 1.9%. Re- 
sults on a larger number of desktop-class ATA drives 
under deployment at the Internet Archive are presented 
by Schwarz et al [17]. They report on a 2% failure rate 
for a population of 2489 disks during 2005, while men- 
tioning that replacement rates have been as high as 6% 
in the past. Gray and van Ingen [9] cite observed fail- 
ure rates ranging from 3.3-6% in two large web prop- 
erties with 22,400 and 15,805 disks respectively. A re- 
cent study by Schroeder and Gibson [16] helps shed light 
into the statistical properties of disk drive failures. The 
study uses failure data from several large scale deploy- 
ments, including a large number of SATA drives. They 
report a significant overestimation of mean time to fail- 
ure by manufacturers and a lack of infant mortality ef- 
fects. None of these user studies have attempted to cor- 
relate failures with SMART parameters or other environ- 
mental factors. 

We are aware of two groups that have attempted 
to correlate SMART parameters with failure statistics. 
Hughes et al [11, 13, 14] and Hamerly and Elkan [10]. 
The largest populations studied by these groups was of 
3744 and 1934 drives and they derive failure models that 
achieve predictive rates as high as 30%, at false posi- 
tive rates of about 0.2% (that false-positive rate corre- 
sponded to a number of drives between 20-43% of the 
drives that actually failed in their studies). Hughes et al. 


also cites an annualized failure rate of 4-6%, based on 
their 2-3 month long experiment which appears to use 
stress test logs provided by a disk manufacturer. 


Our study takes a next step towards a better under- 
standing of disk drive failure characteristics by essen- 
tially combining some of the best characteristics of stud- 
ies from vendor database analysis, namely population 
size, with the kind of visibility into a real-world deploy- 
ment that is only possible with end-user data. 


5 Conclusions 


In this study we report on the failure characteristics of 
consumer-grade disk drives. To our knowledge, the 
study is unprecedented in that it uses a much larger 
population size than has been previously reported and 
presents a comprehensive analysis of the correlation be- 
tween failures and several parameters that are believed to 
affect disk lifetime. Such analysis is made possible by 
a new highly parallel health data collection and analysis 
infrastructure, and by the sheer size of our computing 
deployment. 


One of our key findings has been the lack of a con- 
sistent pattern of higher failure rates for higher temper- 
ature drives or for those drives at higher utilization lev- 
els. Such correlations have been repeatedly highlighted 
by previous studies, but we are unable to confirm them 
by observing our population. Although our data do not 
allow us to conclude that there is no such correlation, 
it provides strong evidence to suggest that other effects 
may be more prominent in affecting disk drive reliabil- 
ity in the context of a professionally managed data center 
deployment. 


Our results confirm the findings of previous smaller 
population studies that suggest that some of the SMART 
parameters are well-correlated with higher failure prob- 
abilities. We find, for example, that after their first scan 
error, drives are 39 times more likely to fail within 60 
days than drives with no such errors. First errors in re- 
allocations, offline reallocations, and probational counts 
are also strongly correlated to higher failure probabil- 
ities. Despite those strong correlations, we find that 
failure prediction models based on SMART parameters 
alone are likely to be severely limited in their prediction 
accuracy, given that a large fraction of our failed drives 
have shown no SMART error signals whatsoever. This 
result suggests that SMART models are more useful in 
predicting trends for large aggregate populations than for 
individual components. It also suggests that powerful 
predictive models need to make use of signals beyond 
those provided by SMART. 
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